Robust Asynchronous Stochastic Gradient-Push: Asymptotically Optimal and Network-Independent Performance for Strongly Convex Functions

We consider the standard model of distributed optimization of a sum of functions F(z)=∑i=1nfi(z), where node i in a network holds the function fi(z). We allow for a harsh network model characterized by asynchronous updates, message delays, unpredictable message losses, and directed communication among nodes. In this setting, we analyze a modification of the Gradient-Push method for distributed optimization, assuming that (i) node i is capable of generating gradients of its function fi(z) corrupted by zero-mean bounded–support additive noise at each step, (ii) F(z) is strongly convex, and (iii) each fi(z) has Lipschitz gradients. We show that our proposed method asymptotically performs as well as the best bounds on centralized gradient descent that takes steps in the direction of the sum of the noisy gradients of all the functions f1(z), …, fn(z) at each step.


Introduction
Distributed systems have attracted much attention in recent years due to their many applications such as large scale machine learning (e.g., in the healthcare domain, Brisimi et al., 2018), control (e.g., maneuvering of autonomous vehicles, Peng et al., 2017), sensor networks (e.g., coverage control, He et al., 2015) and advantages over centralized systems, such as scalability and robustness to faults. In a network comprised of multiple agents (e.g., data centers, sensors, vehicles, smart phones, or various IoT devices) engaged in data collection, it is sometimes impractical to collect all the information in one place. Consequently, distributed optimization techniques are currently being explored for potential use in a variety of estimation and learning problems over networks. This paper considers the separable optimization problem min z ∈ ℝ d F (z) ≔ ∑ i = 1 n f i (z), (1) where the function f i : ℝ d ℝ is held only by agent i in the network. We assume the agents communicate through a directed communication network, with each agent able to send messages to its out-neighbors. The agents seek to collaboratively agree on a minimizer to the global function F(z).
This fairly simple problem formulation is capable of capturing a variety of scenarios in estimation and learning. Informally, z is often taken to parameterize a model, and f i (z) is a loss function measuring how well z matches the data held by agent i. Agreeing on a minimizer of F(z) means agreeing on a model that best explains all the data throughout the network -and the challenge is to do this in a distributed manner, avoiding techniques such as flooding which requires every node to learn and store all the data throughout the network. For more details, we refer the reader to the recent survey by Nedic et al. (2018).
In this work, we will consider a fairly harsh network environment, including message losses, delays, asynchronous updates, and directed communication. The function F(z) will be assumed to be strongly convex with the individual functions f i (z) having a Lipschitz continuous gradient. We will also assume that, at every time step, node i can obtain a noisy gradient of its function f i (z). Our goal will be to investigate to what extent distributed methods can remain competitive with their centralized counterparts in spite of these obstacles.

Literature Review
Research on models of distributed optimization dates back to the 1980s, see Tsitsiklis et al. (1986). The separable model of (1) was first formally analyzed in Nedic and Ozdaglar (2009), where performance guarantees on a fixed-stepsize subgradient method were obtained. The literature on the subject has exploded since, and we review here only the papers closely related to our work. We begin by discussing works that have focused on the effect of harsh network conditions. A number of recent papers have studied asynchronicity in the context of distributed optimization. It has been noted that asynchronous algorithms are often preferred to synchronous ones, due to the difficulty of perfectly coordinating all the agents in the network, e.g., due to clock drift. Papers by Recht et al. (2011);Li et al. (2014); Agarwal and Duchi (2011);Lian et al. (2015) and Feyzmahdavian et al. (2016) study asynchronous parallel optimization methods in which different processors have access to a shared memory or parameter server. Recht et al. (2011) present a scheme called HOGWILD!, in which processors have access to the same shared memory with the possibility of overwriting each other's work. Li et al. (2014) proposes a parameter server framework for distributed machine learning. Agarwal and Duchi (2011) analyze the convergence of gradient-based optimization algorithms whose updates depend on delayed stochastic gradient information such schemes have been shown to achieve superior performance in recent years (see Shi et al., 2015;Sun et al., 2016;Oreshkin et al., 2010;Nedic et al., 2017;Xi and Khan, 2017a;Xi et al., 2018;Qu and Li, 2017;Xu et al., 2015;Qu and Li, 2019;Di Lorenzo and Scutari, 2016); we refer the reader to Pu and Nedic (2018) which took this approach.
One of our main concerns in this paper is to develop decentralized optimization methods which perform as well as their centralized counterparts. Specifically, we will compare the performance of a distributed method for (1) on a network of n nodes with the performance of a centralized method which, at every step, can query all n gradients of the functions f 1 (z), …, f n (z). Since the distributed algorithm gets noise-corrupted gradients, so should the centralized method. Thus, the natural approach is to compare the distributed method to centralized gradient descent which moves in the direction of the sum of the gradients of f 1 (z), …, f n (z).This method of comparison keeps the "computational power" of the two nodes identical.
Traditionally, the bounds derived on distributed methods were considerably worse than those derived for centralized methods. For example, the papers by Nedic and Olshevsky (2015Olshevsky ( , 2016 had bounds for distributed optimization over directed graphs that were worse than the comparable centralized method (in terms of rate of error decay) by a multiplicative factor that, in the worst case, could be as large as n O(n) . This is typical over directed graphs, though better results are possible over undirected graphs. For example, in Olshevsky (2017), in the model of noiseless, undelayed, synchronous communication over an undirected graph, a distributed subgradient method was proposed whose performance, relative to a centralized method with the same computational power, was worse by a multiplicative factor of n.
The breakthrough papers by Chen and Sayed (2015); Pu and Garcia (2017); Morral et al. (2017), were the first to address this gap. These papers studied the model where gradients are corrupted by noise, which we also consider in this paper. Chen and Sayed (2015) examined the mean-squared stability and convergence of distributed strategies with fixed step-size over graphs and showed the same performance level as that of a centralized strategy, in the small step-size regime. In Pu and Garcia (2017) it was shown that, for a certain stochastic differential equation paralleling network gradient descent, the performance of centralized and distributed methods were comparable. In Morral et al. (2017), it was proved, for the first time, that distributed gradient descent with an appropriately chosen step-size, asymptotically performs similarly to a centralized method that takes steps in the direction of the sum of the noisy gradients, assuming iterates will remain bounded almost surely. This was the first analysis of a decentralized method for computing the optimal solution with performance bounds matching its centralized counterpart.
Both Pu and Garcia (2017) and Morral et al. (2017) were over fixed, undirected graphs with no message loss or delays or asynchronicity. As shown in the paper by Morral et al. (2012), this turns out to be a natural consequence of the analysis of those methods. Indeed, on a technical level, the advantage of working over undirected graphs is that they allow for easy distributed multiplication by doubly-stochastic matrices; it was shown in Morral et al. (2012) that if this property holds only in expectation -that is, if the network nodes can multiply by random stochastic matrices that are only doubly stochastic in expectation -distributed gradient descent will not perform comparably to its centralized counterpart.
In parallel to this work, and in order to reduce communication bottlenecks, Koloskova et al. (2019) propose a decentralized SGD with communication compression that can achieve the centralized baseline convergence rate, up to a constant factor. When the objective functions are smooth but not necessarily convex, Lian et al. (2017) show that Decentralized Parallel Stochastic Gradient Descent (D-PSGD) can asymptotically perform comparably to Centralized PSGD in total computational complexity. However, they argue that D-PSGD requires much less communication cost on the busiest node and hence, can outperform C-PSGD in certain communication regimes. Again, both Koloskova et al. (2019) and Lian et al. (2017) are over fixed undirected graphs, without delays, link failures or asynchronicity. The follow-up work by Lian et al. (2018), extends the D-PSGD to the asynchronous case.

Our Contribution
We propose an algorithm which we call Robust Asynchronous Stochastic Gradient Push (RASGP) for distributed optimization from noisy gradient samples over directed graphs with message losses, delays, and asynchronous updates. We will assume gradients are corrupted with additive noise represented by independent random variables, with bounded support, and with finite variance at node i denoted by σ i 2 . Our main result is that the RASGP performs as well as the best bounds on centralized gradient descent that moves in the direction of the sum of noisy gradients of f 1 (z), …, f n (z). Our results also hold if the underlying graphs are time-varying as long as there are no message losses. We give a brief technical overview of this result next.
We will assume that each function f i (z) is μ i -strongly convex with L i -Lipschitz gradient, where ∑ i μ i and L i > 0, i = 1, …, n. The RASGP will have every node maintain an estimate of the optimal solution which will be updated from iteration to iteration; we will use z i (k) to denote the value of this estimate held by node i at iteration k. We will show that, for each node i = 1, …, n, where z* ≔ arg min F(z) and Γ u is the degree of asynchronicity, defined as the maximum number of iterations between two consecutive updates of any agent. The leading term matches the best bounds for (centralized) gradient descent that takes steps in the direction of the sum of the noisy gradients of f 1 (z), …, f n (z), every k/Γ u iterations (see Nemirovski et al., 2009;Rakhlin et al., 2012). Asymptotically, the performance of the RASGP is network independent: indeed, the only effect of the network or the number of nodes is on the constant factor within the O k (1/k 1.5 ) term above. The asymptotic scaling as O k (1/k) is optimal in this setting (Rakhlin et al., 2012).
In other words, asymptotically we get the variance reduction of a centralized method that simply averages the n noisy gradients at each step.
The implication of this result is that one can get the benefit of having n independent processors computing noisy gradients in spite of all the usual problems associated with communications over a network (i.e., message losses, latency, asynchronous updates, oneway communication). Of course, the caveat is that one must wait sufficiently long for the asymptotic decay to "kick in," i.e., for the second term on the right-hand side of (2) to become negligible compared to the first. We leave the analysis of the size of this transient period to future work and note here that it will depend on the network and the number of The RASGP is a variation on the usual distributed gradient descent where nodes mix consensus steps with steps in the direction of their own gradient, combined with a new step-size trick to deal with asynchrony. It is presented as Algorithm 3 in Section 3. For a formal statement of the results presented above, we refer the reader to Theorem 15 in the body of the paper.
We briefly mention two caveats. The first is that implementation of the RASGP requires each node to use the quantity ∑ i = 1 n μ i /n in setting its local stepsize. This is not a problem in the setting when all functions are the same, but, otherwise, ∑ i = 1 n μ i /n is a global quantity not immediately available to each node. Assuming that node i knows μ i one possibility is to use average consensus to compute this quantity in a distributed manner before running the RASGP (for example using the algorithm described in Section 2 of this paper). The second caveat is that, like all algorithms based on the push-sum method, the RASGP requires each node to know its out-degree in the communication graph.

Organization of This Paper
We conclude this Introduction with Section 1.4, which describes the basic notation we will use throughout the remainder of the paper. Section 2 does not deal directly with the distributed optimization problem we have discussed, but rather introduces the problem of computing the average in the fairly harsh network setting we will consider in this paper. This is an intermediate problem we need to analyze on the way to our main result. Section 3 provides the RASGP algorithm for distributed optimization, and then states and proves our main result, namely the asymptotically network-independent and optimal convergence rate. Results from numerical simulations of our algorithm to illustrate its performance are provided in Section 4, followed by conclusions in Section 5.

Notations and Definitions
We assume there are n agents V = {1, ..., n}, communicating through a fixed directed graph G = (V, ℰ), where ℰ is the set of directed arcs. We assume G does not have self-loops and is strongly connected.
For a matrix A, we will use A ij to denote its (i,j)th entry. Similarly, v i and [v] i will denote the ith entry of a vector v. A matrix is called stochastic if it is non-negative and the sum of the elements of each row equals to one. A matrix is column stochastic if its transpose is stochastic. To a non-negative matrix A ∈ ℝ n × n we associate a directed graph G A with vertex set V A = {1, 2, ..., n} and edge set ℰ A = {(i, j)|A ji > 0}. In general, such a graph might contain self-loops. Intuitively, this graph corresponds to the information flow in the update x(k + 1) = Ax(k); indeed, (i, j) ∈ ℰ A if the jth coordinate of x(k + 1) depends on the ith coordinate of x(k) in this update.

Moreover, A k:k = A(k).
Node i is an in-neighbor of node j, if there is a directed link from i to j. Hence, j would be an out-neighbor of node i. We denote the set of in-neighbors and out-neighbors of node i by N i − and N i + , respectively. Moreover, we denote the number of in-neighbors and out-neighbors of node i with d i − and d i + , as its in-degree and out-degree, respectively. By x min and x max we denote min i x i and max i x i respectively, over all possible indices unless mentioned otherwise. We denote a n × 1 column vector of all ones or zeros by 1 n and 0 n , respectively. We will remove the subscript when the size is clear from the context.
Let v ∈ ℝ d be a vector. We denote by v − ∈ ℝ d a vector of the same length such that For all the algorithms we describe, we sometimes use the notion of mass to denote the value an agent holds, sends or receives. With that in mind, we can think of a value being sent from one node, as a mass being transferred.

Push-Sum with Delays and Link Failures
In this section we introduce the Robust Asynchronous Push-Sum algorithm (RAPS) for distributed average computation and prove its exponential convergence. Convergence results proved for this algorithm will be used later when we turn to distributed optimization. The algorithm relies heavily on ideas from Hadjicostis et al. (2016) to deal with message losses, delays, and asynchrony. The conference version of this paper Olshevsky et al. (2018) developed RAPS for the delay-free case, and this section may be viewed as an extension of that work.
Pseudocode for the algorithm is given in the box for Algorithm 1. We begin by outlining the operation of the algorithm. Our goal in this section is to compute the average of vectors, one held by each node in the network, in a distributed manner. However, since the RAPS algorithm acts separately in each component, we may, without loss of generality, assume that we want to average scalars rather than vectors. The scalar held by node i will be denoted by x i (0).
Without loss of generality, we define an iteration by descretizing time into time slots indexed by k = 0,1,2,…. We assume that during each time slot every agent makes at most one update and processes messages sent in previous time slots.
In the setting of no message losses, no delays, no asynchrony, and a fixed, regular, undirected communication graph, the RAPS can be shown to be equivalent to the much simpler iteration where W is an irreducible, doubly stochastic matrix with positive diagonal; standard Markov chain theory implies that x i (t) (1/n)∑ i = 1 n x i (t) in this setting. RAPS does essentially the same linear update, but with a considerable amount of modifications. In particular, we use the central idea of the classic push-sum method (Kempe et al., 2003) to deal with directed communication, which suggests to have a separate update equation for the y-variables, which informs us how we should rescale the x-variables; as well as the central idea of Hadjicostis et al. (2018), which is to repeatedly broadcast sums of previous messages to provide robustness against message loss. While the algorithm in Hadjicostis et al. (2018) handles message losses in a synchronous setting, RAPS can handle delays as well as asynchronicity.
Before getting into details, let us provide a simple intuition behind the RAPS algorithm. Each agent i holds a value (mass) x i and y i . At the beginning of every iteration, i wants to split its mass between itself and its out-neighbors j ∈ N i + . However, to handle message losses, it sends the accumulated x and y mass (running sums which we denote by ϕ i x and ϕ i y ), that i wants to transfer to each of its neighbors, from the start of the algorithm.
Therefore, when a neighbor j receives a new accumulated mass from i, it stores it at ρ ji mass that i has been trying to send since its last successful communication. Then, j updates its x and y mass by adding the new received masses, and finally, updates its estimate of the average to x/y. To handle delays and asynchronicity, timestamps κ i are attached to messages outgoing from i.
The pseudocode for the algorithm may appear complicated at first glance; this is because of the considerable complexity required to deal with directed communications, message losses, delays, and asynchrony.
We next describe the algorithm in more detail. First, in the course of executing the algorithm, every agent i maintains scalar variables x i , y i , z i , ϕ i x , ρ ij y and κ ij for (j, i) ∈ ℰ. The variables x i and y i have the same evolution, however y i is initialized as 1.
Therefore, to save space in describing and analyzing the algorithm, we will use the symbol θ, when a statement holds for both x and y. Similarly, when a statement is the same for both variables x and y, we will remove the superscripts x or y. For example, the initialization ρ ji (0) = 0 in the beginning of the algorithm means both ρ ji x (0) = 0 and ρ ji y (0) = 0.
We briefly mention the intuitive meaning of the various variables. The number z i represents node i's estimate of the initial average. The counter ϕ i θ (k) is the total θ-value sent by i to each of its neighbors from time 0 to k − 1. Similarly, ρ ij θ (k) is the total θ value that i has received from j up to time k − 1. The integer κ i is a timestamp that i attaches to its messages, and the number κ ij tracks the latest timestamp i has received from j.
To obtain an intuition for how the algorithm uses the counters ϕ i θ (k) and ρ ij θ (k), note that, in line 15 of the algorithm, node i effectively figures out the last θ value sent to it by each of its in-neighbors j, by looking at the increment to the ρ ij θ . This might seem needlessly involved, but, the underlying reason is that this approach introduces robustness to message losses.
2: At every iteration k = 0, 1, 2, …, for every node i: Node i broadcasts (ϕ i x , ϕ i y ,κ i ) to its out-neighbors in N i + .

9:
for (ϕ j x , ϕ j y , κ j ′) in the inbox do 10: if κ′ j > κ ij then 11:  Then, node i moves on to process the messages in its inbox in the following way. If agent i has received a message from node j that is newer than the last one it has received before, it will store that message in ρ ij * and discard the older messages. Next, i updates its x and y variables by adding the difference of ρ ij * with the older value ρ ij , for all in-neighbors j. As mentioned above, this difference is equal to the new mass received. Next, ρ ij * overwrites ρ ij in the penultimate step. The last step of the algorithm sets z i to be the rescaled version of x i : In the remainder of this section, we provide an analysis of the RAPS algorithm, ultimately showing that it converges geometrically to the average in the presence of message losses, asynchronous updates, delays, and directed communication. Our first step is to formulate the RAPS algorithm in terms of a linear update (i.e., a matrix multiplication), which we do in the next subsection.

Linear Formulation
Next we show that, after introducing some new auxiliary variables, Algorithm 1 can be written in terms of a classical push-sum algorithm (Kempe et al., 2003) on an augmented graph. Since the y-variables have the same evolution as the x-variables, here we only analyze the x-variables.
In our analysis, we will associate with each message an effective delay. If a message is sent at time k 1 and is ready to be processed at time k 2 , then k 2 − k 1 ≥ 1 is the effective delay experienced by that message. Those messages that are discarded will not have an effective delay associated with them and are considered as lost.
Next, we will state our assumptions on connectivity, asynchronicity, and message loss.

a.
Graph G is strongly connected and does not have self-loops.

b.
The delays on each link are bounded above by some Γ del ≥ 1.

c.
Every agent wakes up and performs updates at least once every Γ u ≥ 1 iterations.

d.
Each link fails at most Γ f ≥ 0 consecutive times.

e.
Messages arrive in the order of time they were sent. In other words, if messages are sent from node i to j at times k 1 and k 2 with (effective) delays d 1 and d 2 , respectively, and k 1 < k 2 , then we have k 1 + d 1 < k 2 + d 2 .
One consequence of Assumption 1 is that the effective delays associated with each message that gets through are bounded above by Γ d ≔ Γ del + Γ u − 1. Another consequence is that for each (i, j) ∈ ℰ, j receives a message from i successfully, at least once every Γ s iterations Part (e) of Assumption 1 can be assumed without loss of generality. Indeed, observe that outdated messages automatically get discarded in Line 10 of our algorithm. For simplicity, it is convenient to think of those messages as lost. Thus, if this assumption fails in practice, the algorithm will perform exactly as if it had actually held in practice due to Line 10. Making this an assumption, rather than a proposition, lets us slightly simplify some of the arguments and avoid some redundancy throughout this paper.
Let us introduce the following indicator variables: τ i (k) for i ∈ {1, …, n} which equals to 1 if node i wakes up at time k, and equals 0 otherwise. Similarly, τ ij l (k), for (i, j) ∈ ℰ, 1 ≤ l ≤ Γ d which is 1 if τ i (k) = 1 and the message sent from node i to j at time k will arrive after experiencing an effective delay of l. 2 Note that if node i wakes up at time k but the message it sends to j is lost, then τ ij l (k) will be zero for all l.
We can rewrite the RAPS algorithm with the help of these indicator variables. Let us adopt the notation that x i (k) refers to x i at the beginning of round k of the algorithm (i.e., before node i has a chance to go through the list of steps outlined in the algorithm box). We will use the same convention with all of the other variables, e.g., y i (k), z i (k), etc. If node i does not wake up at round k, then of course x i (k + 1) = x i (k). Now observe that we can write Likewise, we have which can be shown by considering each case (τ i (k) = 1 or 0); note that we have used the fact that, in the event that node i wakes up at time k, the variable ρ ij x (k + 1) equals the variable ρ ij * x during execution of Line 16 of the algorithm at time k.
Finally, we have that ∀(i, j) ∈ ℰ, the flows ρ ji x are updated as follows: where we make use of the fact that the sum contains only a single nonzero term, since the messages arrive monotonically. To parse the indices in this equation, note that node i actually broadcasts ϕ i x (k + 1 − l) in our notation at iteration k − l; by our definitions, ϕ i is the value of ϕ i x at the beginning of that iteration. To simplify these relations, we introduce the auxiliary variables u ij x for all (i, j) ∈ ℰ, defined through the following recurrence relation: and initialized as u ij x (0) ≔ 0. Intuitively, the variables u ij x represent the "excess mass" of x i that is yet to reach node j. Indeed, this quantity resets to zero whenever a message is sent that arrives at some point in the future, and otherwise is incremented by adding the broadcasted mass that is lost. Note that node i never knows u ij x (k), since it has no idea which messages are lost, and which are not; nevertheless, for purposes of analysis, nothing prevents us from considering these variables.
Let us also define the related quantity, and v ij x (k) ≔ 0 for k < 0. Intuitively, this quantity may be thought of as a forward-looking estimate of the mass that will arrive at node j, if the message sent from node i at time k gets through; correspondingly, it includes not only the previous unsent mass, but the extra mass that will be added at the current iteration.
The key variables for the analysis of our method are the variables we will denote by x ij l (k).
Intuitively, every time a message is sent, but gets lost, we imagine that it has instead arrived into a "virtual node" which holds that mass; once the next message gets through, we imagine that the virtual node has forwarded that mass to its intended destination. This idea originates from Hadjicostis et al. (2016). Because of the delays, however, we need to introduce Γ d virtual nodes for each such event. If a message is sent from i and arrives at j with effective delay l, we will instead imagine it is received by the virtual node b ij l , then sent to b ij l − 1 at the next time step, and so forth until it reaches b ij 1 , and is then forwarded to its destination. These virtual nodes are defined formally later.
Putting that intuition aside, we formally define the variables x ij l (k) via the following set of recurrence relations: and x ij l (k) ≔ 0 when both k ≤ 0 and l = 1, …, Γ d . To parse these equations, imagine what happens when a message is sent from i to j with effective delay of Γ d at time k. The content of this message becomes the value of x ij Γ d according to (8); and, in each subsequent step, , and so forth according to (9). Putting (8) and (9) together, we obtain x ij and particularly, x ij Note that, as is common in many of the equations we will write, only a single term in the sums can be nonzero (this is not obvious at this point and is a result of Lemma 1).
Before proceeding to the main result of this section, we state the following lemma, whose proof is immediate.
Lemma 1 If τ ij l (k) = 1, the following statements are satisfied: a.
Hence, by (10) we have, The next lemma is essentially a restatement of the observation that the content of every x ij l′ eventually "passes through" x ij 1 .
Lemma 4 For k = 0,1,… and (i, j) ∈ ℰ we have: u ij x (k + 1) + ρ ji Parsing these equations, (12) simply states that the value of x ij 1 (k) can be thought of as impacting ρ ji x at time k; recall that the content of x ij 1 (k) is a message that was sent from node i to j at time k − l with an effective delay of l, for some 1 ≤ l ≤ Γ d (cf. Equation 11). On the other hand, (13) may be thought of a "conservation of mass" equation. All the mass that has been sent out by node i has either: (i) been lost (in which case it is in u ij j (in which case it is in ρ ji x ), or (iii) is in the process of reaching node j but delayed (in which case it is in some x ij l ).
Although this lemma is arguably obvious, a formal proof is surprisingly lengthy. For this reason, we relegate it to the Appendix.
We next write down a matrix form of our updates. As a first step, define the (n + m′) × 1 column vector collecting y-values similarly.
Now, we have all the tools to show the linear evolution of χ(k). By Equations (4), (5) and (12) we have, Moreover, by the definitions of x ij , v ij and (4) it follows, Finally, by (4) and (7) we obtain, Using (14) to (16) we can write the evolution of χ(k) and ψ(k) in the following linear form: where M(k) ∈ ℝ (n + m′) × (n + m′) is an appropriately defined matrix.
We have thus completed half of our goal: we have shown how to write RAPS as a linear update. Next, we show that the corresponding matrices are column-stochastic.
Lemma 5 M(k) is column stochastic and its positive elements are at least 1/(max i {d i + } + 1).
This lemma can be proved "by inspection." Indeed, M(k) is column stochastic if and only if, for every χ(k), we have 1 T χ(k + 1) = 1 T χ(k). Thus one just needs to demonstrate that no mass is ever "lost," i.e., that a decrease/increase in the value of one node is always accompanied by an increase/decrease of the value of another node, which can be done just by inspecting the equations. A formal proof is nonetheless given next.
Proof To show that M(k) is column stochastic, we study how each element of χ(k) influences χ(k + 1).
For i = 1, …, n, the ith column of M(k) represents how x i (k) influences χ(k + 1). We will use (14) to (16) to find these coefficients.
First, x i (k) influences x i (k + 1) with the coefficient 1 − τ i (k) + τ i (k)/(d i For j ∈ N i + , x i (k) influences x ij l (k + 1) by τ ij l (k)/(d i + + 1) and u ij x (k + 1) with coefficient Summing these coefficients up results in 1.
which sum up to 1.
Note that all the coefficients above are at least 1/(max For further analysis, we augment the graph G to ℋ(k) ≔ G M(k) = (V A , ℰ A (k)) by adding the following virtual nodes: b ij l for l = 1, …, Γ d and (i, j) ∈ ℰ, which hold the values x ij l and y ij l ; We also add the nodes c ij for (i, j) ∈ ℰ which hold the values u ij x and u ij y .
In ℋ(k), there is a link from b ij l to b ij l − 1 for 1 < l ≤ d and from b ij 1 to j as they forward their values to the next node. Moreover, if τ ij l (k) = 1 for some 1 ≤ l ≤ Γ d , then there is a link from both c ij and i to b ij l .
If τ ij l (k) = 0 for 1 ≤ l ≤ Γ d then c ij has a self loop, and if also τ i (k) = 1, there's a link from i to c ij . All non-virtual agents i ∈ V, have self-loops all the time (see Fig. 2).
Recursions (17) and Lemma 5 may thus be interpreted as showing that the RAPS algorithm can be thought of as a push-sum algorithm over the augmented graph sequence {ℋ(k)}, where each agent (virtual and non-virtual) holds an x-value and a y-value which evolve similarly and in parallel.

Exponential Convergence
The main result of this section is exponential convergence of RAPS to initial average, stated next.
It is worth mentioning that even though 1/(1 − λ) = O(n p(n) ) where p(n) = O(n), this is a bound for a worst case scenario and on average, as it can be seen in numerical simulations, RAPS performs better. Moreover, when the graph G satisfies certain properties, such as regularity, and also there is no link delays and failures, we have 1/(1 − λ) = O(n 3 ) (see Theorem 1 in Nedic and Olshevsky, 2016). More broadly, that paper establishes that 1/(1 − λ) will scale with the mixing rate of the underlying Markov process.
Unfortunately, this theorem does not follow immediately from standard results on exponential convergence of push-sum. The reason is that the connectivity conditions assumed for such theorems are not satisfied here: there will not always be paths leading to virtual nodes from non-virtual nodes. Nevertheless, with some suitable modifications, the existence of paths from virtual nodes to other virtual nodes is sufficient, as we will show next.
Before proving the theorem, we need the following lemmas and definitions. Given a sequence of graphs G 0 , G 1 , G 2 , …, we will say node b is reachable from node a in time period k 1 to k 2 (k 1 < k 2 ), if there exists a sequence of directed edges e k 1 , e k 1 + 1 , …, e k 2 such that e k is in G k , the destination of e k is the origin of e k+1 for k 1 ≤ k < k 2 , and the origin of e k 1 is a and the destination of e k 2 is b.
Our first lemma provides a standard lower bound on the entries of the column-stochastic matrices from (17).
Lemma 7 M k+nΓ s −1:k has positive first n rows, for any k ≥ 0. The positive elements of this matrix are at least α = (1/n) nΓ s .
rows. Moreover, since all positive elements of M are at least 1/n, the positive elements of M k+nΓ s −1:k are at least (1/n) nΓ s . ■ Next, we give a reformulation of the push-sum update that will be key to showing the exponential convergence of the algorithm. The proof is a minor variation of Lemma 4 in Nedic and Olshevsky (2016).
Lemma 8 Consider the vectors u(k) ∈ ℝ d , v(k) ∈ ℝ + d , and square matrix A(k) ∈ ℝ + d × d , for k ≥ 0 such that, Also suppose u i (k) = 0 if v i (k) = 0, ∀k,i. Define u − (k) ∈ ℝ d as: where ∘ denotes the element-wise product of two vectors. Then we have, Proof Since u i (k) = 0 if v i (k) = 0, u i (k) = r i (k)v i (k) holds for all i, k. Substituting in (19) we obtain, Since, by definition r i (k) = 0 if v i (k) = 0, ∀k,i, we get Therefore, Our next corollary, which follows immediately from the previous lemma, characterizes the dichotomy inherent in push-sum with virtual nodes: every row either adds up to one or zero.
Corollary 9 Consider the matrix B(k) defined in Lemma 8. Let us define the index set J k ≔ {i|v i (k) ≠ 0}. If i ∉ J k , the ith column of B(k) and ith row of B(k − 1) only contain zero entries. Moreover, Hence, the ith row of B(k) sums to 1 if and only if v i (k + 1) ≠ 0 or i ∈ J k+1 .
Our next lemma characterizes the relationship between zero entries in the vectors χ(k) and ψ(k).
Proof First we note that ψ(0) = [1 n ⊺ , 0 m′ ⊺ ] ⊺ and each node i ∈ V has a self-loop in graph ℋ(k) for all k ≥ 0; hence, ψ h (k) ≥ 0 for all h and particularly, ψ i (k) > 0 for i = 1, …, n. Now suppose h > n and corresponds to a virtual agent a ℎ ∈ V A . If ψ h (k) = 0, it means a h has already sent all its y-value to another node or has not received any y-value yet. In either case, that node also has no remaining x-value as well and χ h (k) = 0. ■ Let us define ψ − (k) ∈ ℝ n + m′ , k ≥ 0 by Moreover, we define the vector z(k) by setting z(k) ≔ χ(k) ∘ ψ − (k). By (17) and Lemma 10, we can use Lemma 8 to obtain, z(k + 1) = P(k)z(k), where P(k) ≔ diag(ψ − (k + 1))M(k)diag(ψ(k)). Let us define Then, by Corollary 9 we have each z i (k + 1), i ∈ I k+1 , is a convex combination of z j (k)'s for j ∈ I k . Therefore, max i ∈ I k + 1 z i (k + 1) ≤ max i ∈ I k z i (k), min i ∈ I k + 1 z i (k + 1) ≥ min i ∈ I k z i (k) .
These equations will be key to the analysis of the algorithm. We stress that we have not shown that the quantity min i z i (k) is non-decreasing; rather, we have shown that the related quantity, where the minimum is taken over I k , the set of nonzero entries of ψ(k), is non-increasing.

Our next lemma provides lower and upper bounds on the entries of the vector ψ(k).
Lemma 11 For k ≥ 0 and 1 ≤ i ≤ n we have: nα ≤ ψ i (k) ≤ n .

Proof
For n + 1 ≤ h ≤ n + m′m, suppose ψ h corresponds to a virtual node a h corresponding to some link (i, j) ∈ ℰ. If ψ h (k) is positive, it is carrying a value sent from i at k − nΓ s or later, which has experienced link failure or delays. This is because each value gets to its destination after at most Γ s iterations. Since i has self-loops all the time, a h is reachable from i in period k − nΓ s to k − 1; Hence, M ℎi k − 1: k − nΓ s ≥ α, and it follows, ψ ℎ (k) ≥ αψ i (k − nΓ s ) ≥ nα 2 .
Next, we are able to find a lower bound on the positive elements of P(k). The proof of the following corollary is immediate.

b.
Positive entries of first n columns of P (k) are at least (1/n)α(nα) = α 2 . Similarly, the last m′ columns have positive entries of at least α 3 .

c.
For h > n, if h ∈ I k+nΓ s then P ℎi (k) > 0 for some 1 ≤ i ≤ n.
Our next lemma, which is the final result we need before proving the exponential convergence rate of RAPS, provides a quantitative bound for how multiplication by the matrix P shrinks the range of a vector.

Define s t (k) ≔ max
i ∈ I knΓ s + t u i (k) − min i ∈ I knΓ s + t u i (k) .
Proof Let us define r t (k) ≔ max 1 ≤ i ≤ n u i (k) − min 1 ≤ i ≤ n u i (k) .
Moreover, by a similar argument for j ≤ n,
We have: In the above, we used (18) and (24), the boundedness of ψ i (k), and the fact that ψ i (k) = 0 for i ∉ I k .

Remark:
Observe that our proof did not really use the initialization ψ(0) = 1, except to observe that the elements ψ(0) are positive, add up to n, and the implication that ψ(k) satisfies the bounds of Lemma 11. In particular, the same result would hold if we viewed time 1 as the initial point of the algorithm (so that ψ(1) is the initialization), or similarly any time k. We will use this observation in the next subsection.

Perturbed Push-Sum
In this subsection, we begin by introducing the Perturbed Robust Asynchronous Push-Sum algorithm, obtained by adding a perturbation to the x-values of (non-virtual) agents at the beginning of every iteration they wake up.
We show that, if the perturbations are bounded, the resulting z(k) nevertheless tracks the average of χ(k) pretty well. Such a result is a key step towards analyzing distributed optimization protocols. In this general approach to the analyses of distributed optimization methods, we follow Ram et al. (2010) where it was first adopted; see also Nedic and Olshevsky (2016) and Nedic and Olshevsky (2015) where it was used.
Adopting the notations introduced earlier and by the linear formulation (17) we have,
Define for k ≥ 1, We obtain Define z t (k) ≔ χ t (k) ∘ ψ − (k) for 0 ≤ t ≤ k (cf. Equation 20). We have We may view each z t (k) as the outcome of a push-sum algorithm, initialized at time t, and apply Theorem 6. This immediately yields the following result, with part (b) an immediate consequence of part (a).

Robust Asynchronous Stochastic Gradient-Push (RASGP)
In this section we present the main contribution of this paper, a distributed stochastic gradient method with asymptotically network-independent and optimal performance over directed graphs which is robust to asynchrony, delays, and link failures.
Recall that we are considering a network G of n agents whose goal is to cooperatively solve the following minimization problem where each f i : ℝ d ℝ is a strongly convex function only known to agent i. We assume agent i has the ability to obtain noisy gradients of the function f i .
The RASGP algorithm is given as Algorithm 3. Note that we use the notation g i (k) for a noisy gradient of the function f i (z) at z i (k) i.e., g i (k) = g i (k) + ε i , where g i (k) ≔ ∇f i (z i (k)) and ε i is a random vector.
The RASGP is based on a standard idea of mixing consensus and gradient steps, first analyzed in Nedic and Ozdaglar (2009). The push-sum scheme of Section 2, inspired by Hadjicostis et al. (2016), is used instead of the consensus scheme, which allows us to handle delays, asynchronicity, and message losses; this is similar to the approach taken in Nedic and Olshevsky (2015). We note that a new step-size strategy is used to handle asynchronicity: when a node wakes up, it takes steps with a step-size proportional to the sum of all the step-sizes during the period it slept. As far as we are aware, this idea is new.
We will be making the following assumption on the noise vectors.
Assumption 2 ε i is an independent random vector with bounded support, i.e., ‖ε i ‖ ≤ b i , i = 1, …, n. Moreover, E[ε i ] = 0 and E[ ε i Next, we state and prove the main result of this paper, which states the linear convergence rate of Algorithm 3.

Theorem 15
Suppose that:

2.
Each objective function f i (z) is μ i -strongly convex over ℝ d .
Then, the RASGP algorithm with the step-size α(k) = n/(μk) for k ≥ 1 and α(0) = 0, will converge to the unique optimum z* with the following asymptotic rate: for all i = 1, …, n, we have x i x i − β i (k)g i (k); x i x i + ∑ j ∈ N i − ρ ij * x − ρ ij x , y i y i + ∑ j ∈ N i − ρ ij * y − ρ ij y ; 18: ρ ij x ρ ij * x , ρ ij y ρ ij * y

Remark 16
We note that each agent stores variables x i , y i , κ i , z i , ϕ i x , ϕ i y and ρ ij x , ρ ij y , κ ij for all in-neighbors j ∈ N i − . Hence, the memory requirement of the RASGP algorithm for each We next turn to the proof of Theorem 15. First, we observe that Algorithm 3 is a specific case of multi-dimensional Perturbed Robust Asynchronous Push-Sum. In other words, each coordinate of vectors x i , z i , ϕ i x and ρ ij x will experience an instance of Algorithm 2. Hence, there exists an augmented graph sequence {ℋ(k)} where the Algorithm 3 is equivalent to perturbed push-sum consensus on ℋ(k) where each agent a ℎ ∈ V A holds vectors x h and y h .
In other words, we will be able to apply Theorem 14 to analyze Algorithm 3.
Our first step is to show how to decouple the action of Algorithm 3 coordinate by coordinate. For each coordinate 1 ≤ ℓ ≤ d, let χ l ∈ ℝ n + m′ stack up the ℓth entries of x-values of all agents (virtual and non-virtual) in V A . Additionally, define Δ ℓ (k) ∈ ℝ n + m′ to be the vector stacking up the ℓth entries of perturbations. i.e., Then, by the definition of the algorithm, we have for all ℓ = 1, …, d, These equations write out the action of Algorithm 3 on a coordinate-by-coordinate basis.
In order to prove Theorem 15, we need a few tools and lemmas. As already mentioned, our first step will be to argue that Algorithm 3 converges by application of Theorem 14. This requires showing the boundedness of the perturbations Δ ℓ (k), which, as we will show, reduces to showing the vectors z i (k) are bounded. The following lemma will be useful to establish this boundedness.
Lemma 17 (Nedic and Olshevsky, 2016, Lemma 3) Let q: ℝ d ℝ be a ν-strongly convex function with ν > 0 which has Lipschitz gradients with constant L. let v ∈ ℝ d and u ∈ ℝ d defined by, where α ∈ (0,ν/8L 2 ] and p: ℝ d ℝ d is a mapping such that, Then, there exists a compact set S ⊂ ℝ d and a scalar R such that, We now argue that the iterates generated by Algorithm 3 are bounded.

Lemma 18
The iterates z i (k) generated by Algorithm 3 will remain bounded.
Since the perturbations are only added to the non-virtual agents, which have strictly positive y-values, we conclude [u ℓ (k)] h = 0 if ψ h (k) = 0. Hence, the assumptions of Lemma 8 and Corollary 9 are satisfied. Adopting the definition of I k and P(k) from previous sections, we get for i ∈ I k+1 .
where u j (k) ∈ ℝ d is created by collecting the jth entries of all u ℓ (k), i.e., u i (k) = x i (k) − β i (k)g i (k), if i ∈ V and τ i (k) = 1, x i (k), otherwise.
Now consider each term on the right hand side of (29) for j ∈ I k . Suppose j ≤ n and τ j (k) = 1, then we have: u j (k) y j (k) = z j (k) − β j (k) y j (k) ( ∇f j (z j (k)) + ε j (k)) .
Since lim k→∞ α(k) = 0 and k − κ i (k) = 0 and k − κ i (k) ≤ Γ u , lim k→∞ β j (k) = 0. Moreover, by Lemma 11, y j (k) is bounded below; thus, lim k→∞ β j (k)/y j (k) = 0 and there exists k j such that for k ≥ k j , , β j (k)/y j (k) ∈ 0, μ j /8L j 2 . Applying Lemma 17, it follows that for each j there exists a compact set S j and a scalar R j such that for k ≥ k j , if τ j (k) = 1, Moreover, if τ j (k) = 0 or j > n we have, u j (k) y j (k) = z j (k) .
Let k z ≔ max i k i . Using mathematical induction, we will show that for all k ≥ k z : where R ≔ max{max i R i , max j ∈ I k z z j (k z ) }. Equation (32) holds for k = k z . Suppose it is true for some k ≥ k z . Then by (30) and (31) we have, Also by (29), for i ∈ I k+1 , z i (k + 1) is a convex combination of u j (k)/y j (k)'s, where j ∈ I k . Hence, Define B z ≔ max{R, max i ∈ I k , k < k z z i (k) } and we have ‖z i (k)‖ ≤ B z , ∀k ≥ 0. ■ We next explore a convenient way to rewrite Algorithm 3. Let us introduce the quantity w i (k), which can be interpreted as the x-value of agent i, if it performed a gradient step at every iteration, even when asleep: Also, define w ℓ ∈ ℝ n + m′ by collecting the ℓth dimension of all w i 's and w(k) ≔ (∑ i = 1 n + m′ w i (k))/n. Moreover, define g ℓ ∈ ℝ n + m′ by collecting the ℓth value of gradients of all agents (0 for virtual agents), i.e., Additionally, define ε i (k) ∈ ℝ d as the noise injected to the system at time k by agent i, i.e., ε i (k) = β i (k)ε i (k), if i ∈ V and τ i (k) = 1, 0, otherwise, and ε l (k) ∈ ℝ n + m′ as the vector collecting the ℓth values of all ε i (k)'s.
We then have the following lemma.
Lemma 19 Proof We consider two cases: • If τ i (k) = 0, then (35) reduces to w i (k + 1) = w i (k) − α(k)g i (k); noting that, because node i did not update at time k we have that g i (k) = g i (k + 1) and this is the correct update.
• For all other nodes (i.e., for both virtual nodes and nodes with τ i (k) = 1, we have (28). Since χ ℓ (k+1) = M(k)(χ ℓ (k) + Δ ℓ (k)) and, using the definition of w i (k), we have that for these nodes, w i (k + 1) = x i (k + 1); (28) implies the conclusion. ■ This lemma allows us to straightforwardly analyze how the average of w(k) evolves. Indeed, summing all the elements of (35) and dividing by n for each ℓ = 1, …, d we obtain, We next give a sequence of lemmas to the effect that all the quantities generated by the algorithm are close to each other over time. Define, where, recall, V A is our notation for all the nodes in the augmented graph (i.e., including virtual nodes). Moreover, we will extend the definition of from Line 4 of Algorithm 3 to all k via the same formula β i (k) ≔ ∑ t = κ i (k) + 1 k α(t). Our first lemma will show that each z i (k) closely tracks x(k).
Since Γ u ≥ 1, we obtain k ≤ (κ i (k) + 1)Γ u , or, Proof By definition of wwe have, Using (38) we have, where M i was defined through (39). ■ We next remark on a couple of implications of the past series of lemmas.

Lemma 23
Proof Since ∇f i is L i -Lipschitz, we have, Using Corollary 22, the lemma is proved. ■ We are now in a position to rewrite Algorithm 3 as a sort of perturbed gradient descent. Let us define, η(k) ≔ 1 μk ∑ i = 1 n (g i (k) − ∇f i (w(k))) .
In other words, with the exception of the η(k) term, what we have is exactly a stochastic gradient descent method on the function F(⋅).
The following lemmas bound ε(k). Let us define ν i (k) = k − κ i (k) as the number of iterations agent i has skipped since it's last update. By Assumption 1, ν i (k) ≤ Γ u .

Lemma 24
We have β i (k) = O k (1/k), ∀i. Moreover, where the second equality is the result of the noise terms being independent and zero-mean. ■ Our next observation is a technical lemma which is essentially a rephrasing of Lemma 17 above.
Lemma 27 There exists a constant B w and time k w such that w(k) ≤ B w with probability one, for k ≥ k w .
Proof We have where μk ε(k) + η(k) is bounded. Moreover, there exists k w such that for k ≥ k w , 1 μk ∈ 0, μ/8L 2 . Therefore, by Lemma 17 there exists a compact set S w and a scalar R w > 0 such that for k ≥ k w , w(k + 1) ≤ w(k) , for w ∈ S w , R w , for w ∈ S w .
Therefore, setting B w ≔ max{R w , w(k w ) } will complete the proof. ■ As a consequence of this lemma, because η(k) 2 ≤ B η , this lemma implies there is a constant B 1 such that for k ≥ k w , with probability one. This now puts us in a position to show that w(k) converges in mean square to the optimal solution.
We will bound each of the terms on the right. We begin with the easiest one, which is the last one: The middle term is bounded as where we used (41).
This lemma is stated and proved for t′ = 1 in (Rakhlin et al., 2012, Lemma 3), and the case of general t′ follows immediately.
We are almost ready to complete the proof of Theorem 15; all that is needed is to refine the convergence rate of w(k) to x*. Now as a consequence of (45)  Furthermore, by the finite support of μkε(k), by Corollary 25, we also have that We now use these observations to provide a proof of our main result.

Proof of Theorem 15
Essentially, we rewrite the proof of Lemma 28, but now using the fact that E[ w(k) − z * ] = O k (1/ k) from (46). This allows us to make two modification to the arguments of that lemma. First, we can now replace (43) by where we used (47) To save space, let us define a k ≔ E[ w(k) − z * 2 ]. Multiplying both sides of relation above by k 2 we obtain, a k + 1 k 2 ≤ a k 1 − 2 k k 2 + E[ ε(k) 2 ]k 2 + O k (k −0.5 ) .
Summing the relation above for k = 0, …, T implies, Now, let us estimate the first term on the right hand side of relation above, where we used Lemma 24 in the last equality. Define t i (j) as the j'th time agent i has woken up, and set t i (0) = −1. Then we can rewrite the relation above as, Combining relations above and then dividing both sides by T 2 we obtain, We next argue that the same guarantee holds for every z i (k). Indeed, for each i = 1, …, m, z i (k) − z * 2 = z i (k) − w(k) + w(k) − z * 2 = z i (k) − w(k) 2 + 2 z i (k) − w(k) w(k) − z * + w(k) − z * 2 .
Now from Corollary 22, we know that with probability one, z i (k) − w(k) 2 = O k (1/k).
Taking expectation of both sides and using (49)  Putting this together with (49) completes the proof. ■

Time-Varying Graphs
We remark that Theorems 6, 14 and 15 all extend verbatim to the case of time-varying graphs with no message losses. Indeed, only one problem appears in extending the proofs in this paper to time-varying graphs: a node i may send a message to node j; that message will be lost; and afterwards node i never sends anything to node j again. In this case, Lemmas 7 and 11 do not hold. Indeed, examining Lemma 11, we observe what can very well happen is that all of χ i (k) and ψ i (k) are "lost" over time into messages that never arrive. However, as long as no messages are lost, the proofs in this paper extend to the time-varying case verbatim. On a technical level, the results still hold if u ij x (k) = 0 , u ij y (k) = 0 (virtual node c ij ∈ V A holds no lost message), when link (i,j) is removed from the network at time k, and the graph G stays strongly connected (or B-connected, i.e., there exists a positive integer B such that the union of every B consecutive graphs is strongly connected).

On the Bounds for Delays, Asynchrony, and Message Losses
It is natural to what extent the assumption of finite upper bounds on delays, asynchrony, and message losses are really necessary. A natural example which falls outside our framework is a fixed graph G, where, at each time step, every link in G appears with probability 1/2. A more general model might involve a different probability p e of failure for each edge e.
We observe that our result can already handle this case in the following manner. For simplicity, let us stick with the scenario where every link appears with probability 1/2. Then the probability that, after time t, some link has not appeared is at most m(1/2) t , where m is the number of edges in G. This implies that if we choose B = O(log(mnT)), then with hight probability, the sequence of graphs G 1 ,…G T is B-connected.
Thus our theorem applies to this case, albeit at the expense of some logarithmic factors due to the choice of B. We remark that it is possible to get rid of these factors by directly analyzing the decrease in E[ z(t) − z * 2 2 ] coming from the random choice of graph G. Since our arguments are already quite lengthy, we do not pursue this generalization here, and refer the reader to Lobel and Ozdaglar (2010); Srivastava and Nedic (2011) where similar arguments have been made.

Setup
In this section, we simulate the RASGP algorithm on two classes of graphs, namely, random directed graphs and bidirectional cycle graphs. The main objective function is chosen to be a strongly convex and smooth Support Vector Machine (SVM), i.e. F (ω, γ) = 1 2 ( ω 2 + γ 2 ) + C N ∑ j = 1 N ℎ(b j (A j ⊺ ω + γ)) where ω ∈ ℝ d − 1 and γ ∈ ℝ are the optimization variables, and A j ∈ ℝ d − 1 , b j ∈ {−1, + 1}, j = 1, …, N are the data points and their labels, respectively. The coefficient C N ∈ ℝ penalizes the points outside of the soft margin. We set C N = c/N, c = 500 in our simulations, which depends on the total number of data points. Here, ℎ: ℝ ℝ is the smoothed hinge loss, initially introduced in Rennie and Srebro (2005), defined as follows: To solve this problem in a distributed way, we suppose all data points are spread among agents. Hence, the local objective functions are f i (ω i , γ i ) = 1 2n ( ω 2 + γ 2 ) + C N ∑ j ∈ D i ℎ(b j (A j ⊺ ω + γ)), where D i ⊂ {1,2, …, N} is an index set for data points of agent i and N is the total number of data points. We choose the size of the data set for each local function to be a constant (|D i | = 50), thus N = 50n. It is easy to check that each f i has Lipschitz gradients and is strongly convex with μ i = 1/n.
We will compare our results with a centralized gradient descent algorithm, which updates every Γ u iterations using the step-size sequence α c (k) = Γ u /(μk), in the direction of the sum of the gradients of all agents.
To make gradient estimates stochastic, we add a uniformly distributed noise Agents wake up with probability P w and links fail with probability P f , unless they reach their maximum allowed value where the algorithm forces the agent to wake up or the link to work successfully. The link delays are chosen uniformly between 1 to Γ del .
Each data set D i is synthetically generated by picking 25 data points around each of the centers (1,1) and (3,3) with multivariate normal distributions, labeled −1 and +1, respectively. In generating strongly connected random graphs, we pick each edge with a probability of 0.5 and then check if the resulting graph is strongly connected; if it isn't, we repeat the process. Since the initial step-sizes for the distributed algorithm can be very large (e.g., α(1) = 50 for n = 50), to stabilize the algorithms, both algorithms are started with k 0 = 100. This wouldn't affect the asymptotic convergence performance. Moreover, the initial point of the centralized algorithm and all agents in RASGP are chosen as 1 d .
Let us denote by z(k) ≔ (1/n)∑ i = 1 n z= i (k) the average of z-values of non-virtual agents.
Then, we define optimization errors E dist ≔ z(k) − z * 2 and E c (k) ≔ x c (k) − z * 2 for RASGP and Centralized stochastic gradient descent, respectively.
Since our performance guarantees are for the expectation of (squared) errors, for each network setting, we perform up to 1000 Monte-Carlo simulations and use their corresponding performance to estimate the average behavior of the algorithms. Since accurately estimating the true expected value requires an extremely large number of simulations, in order to alleviate the effect of spikes and high variance, we take the following steps. First a batch of simulations are performed and their average is calculated. Next, to obtain a smoother plot, an average over every 100 iterations is taken. And finally, the median of these outputs over all the batches is our estimate of the expected value.
We report two figures for each setting: one including the errors E dist and E c , and another one including k × E dist and k × E c to demonstrate the convergence rates.
Finally, to study the non-asymptotic behavior of RASGP and its dependence on network size n, we have compared the performance of the centralized stochastic gradient descent and RASGP over a bidirectional cycle graph, with error variances of n 2 σ 2 and σ i 2 = σ 2 , respectively. Then, we plot the ratio E c (k)/E dist (k) over n, for different iterations k.

Results
Our simulation results are consistent with our theoretical claims (due to the performance of centralized and decentralized methods growing closer over time) and show the achievement of an asymptotic network-independent convergence rate. Fig. 3 shows that when there is no link failure or delay and all agents wake up at every iteration (Γ s = 2), RASGP and centralized gradient descent have very similar performance. When we allow links to have delays and failures (see Fig. 4), as well as asynchronous updates (see Fig. 5), it takes longer for RASGP to reach its asymptotic convergence rate.
We observe that, with all the other parameters fixed, the RASGP performs better on a random graph than on a cycle graph (see Figs. 5 and 6). A possible reason is that the cycle graph has a higher diameter or mixing time compared to the random graph, resulting in a slower decay of the consensus error.
We notice that by fixing the network size, increasing the number of iterations brings us closer to linear speed-up (see Fig. 7). On the other hand, when fixing the number of iterations, increasing the number of nodes, after a certain point, does not help speeding up the optimization. Moreover, by allowing link delays and failures (see Fig. 7b) we require more iterations to achieve network independence.

Conclusions
The main result of this paper is to stablish asymptotically, network independent performance for a distributed stochastic optimization method over directed graphs with message losses, delays, and asynchronous updates. Our work raises several open questions.
The most natural question raised by this paper concerns the size of the transients. How long must the nodes wait until the network-independent performance bound is achieved? The answer, of course, will depend on the network, but also on the number of nodes, the degree of asynchrony, and the delays. Understanding how this quantity scales is required before the algorithms presented in this work can be recommended to practitioners.
More generally, it is interesting to ask which problems in distributed optimization can achieve network-independent performance, even asymptotically. For example, the usual bounds for distributed subgradient descent (see, e.g., Nedic et al., 2018) depend on the spectral gap of the underlying network; various worst-case scalings with the number of nodes can be derived, and the final asymptotics are not network-independent. It is not immediately clear whether this is due to the analysis, or a fundamental limitation that will not be overcome. LHS = u ij x (K + 1) + ρ ji x (K + 1) + ∑ l = 1 Γ d x ij l (K + 1), RHS = ∑ l = 1 Γ d x ij l (K) + ρ ji x (K) + v ij x (K) = ∑ l = 1 Γ d x ij l (K) + ρ ji x (K) + u ij x (K) − ϕ i x (k) + ϕ i x (K + 1) = ϕ i x (K + 1) .
The last equality holds because of the induction hypothesis (13) for k = K − 1, hence completing the proof. ■